ML Course Part 1 - Introduction to Machine Learning

Author

Alexandre Bry

1 Definitions

1.0.1 Machine Learning

Machine Learning (ML)

Feeding data into a computer algorithm in order to learn patterns and make predictions in new and different situations.

ML Model

Computer object implementing a ML algorithm, trained on a set of data to perform a given task.

1.1 Categories of ML

1.1.1 Type of dataset

  • Supervised: for each input in the dataset, the expected output is also part of the dataset
  • Unsupervised: for each input in the dataset, the expected output is not part of the dataset
  • Semi-supervised: only a portion of the inputs of the dataset have their expected output in the dataset
  • Reinforcement: there is no predefined dataset, but an environment giving feedback to the model when it takes actions dataset

1.1.2 Type of output

  • Classification: assigning one (or multiple) label(s) chosen from a given list of classes to each element of the input
  • Regression: assigning one (or multiple) value(s) chosen from a continuous set of values
  • Clustering: create categories by grouping together similar inputs

1.2 Dataset

1.2.1 Definition

Dataset

A collection of data used to train, validate and test ML models.

1.2.2 Content

Instance (or sample)

An instance is one individual entry of the dataset.

Feature (or attribute or variable)

A feature is a type of information stored in the dataset about each instance.

Label (or target or output or class)

A label is a piece of information that the model must learn to predict.

Imagine a table

1.2.3 Subsets

Dataset subsets

A ML dataset is usually subdivided into three disjoint subsets, with distinctive role in the training process:

  • Training set: used during training to train the model,
  • Validation set: used during training to assess the generalization capability of the model, tune hyperparameters and prevent overfitting,
  • Test set: used after training to evaluate the performance of the model on new data it has not encountered before.

Metaphor of studies: exercises, past years exams and real exam

2 Overview of ML methods

2.1 Supervised Learning

2.1.1 Linear Regression

Used for predicting continuous values:

  • Simple: \(y = \alpha + \beta x\)
  • Multiple: \(y = \alpha + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n\)
  • Polynomial: \(y = \alpha + \beta_1 x + \beta_2 x^2 + \cdots + \beta_n x^n\)
  • and many others…

Simple Linear Regression[1]

Simple Linear Regression[1]

2.1.2 Logistic Regression

Used for classification problems:

  • Binomial: only two possible categories
  • Multinomial: three or more possible categories
  • Ordinal: three or more possible categories which are ordered

Binomial Logistic Regression[2]

Binomial Logistic Regression[2]

2.1.3 Decision Trees

A tree-like structure used for both classification and regression

Decision Tree

Decision Tree

2.1.4 Random Forests

An ensemble method that combines multiple decision trees:

  • Train independently \(B\) trees using:
    • Bagging: each tree is fitted on a random subset of the training set
    • Feature bagging: each split in the decision tree (i.e. each node) is chosen among a subset of the features
  • Take a decision by aggregating individual decisions of each tree

2.1.5 Boosting

An ensemble method that combines weak learners (usually decision trees) to form a stronger model:

  • Choose a simple base learner (e.g. small decision trees with fixed number of leaves)
  • Repeatedly:
    1. Train a new base learner on the weighted training set
    2. Add this new learner to the ensemble
    3. Evaluate the performance of the ensemble
    4. Give more weight in the training set to misclassified data

2.1.6 K-Nearest Neighbors (KNN)

A non-parametric method for classification and regression

K-NN Classification[3]

K-NN Classification[3]

2.1.7 Naive Bayes

A probabilistic classifier based on Bayes’ theorem.

2.1.8 Support Vector Machines (SVM)

Used for classification and regression, effective in high-dimensional spaces:

  • Separate the feature space using optimal hyperplanes
  • Features are mapped in a higher dimensional space to allow to fit non-linearly in the original feature space

SVM kernel trick: map features to a higher dimensional space to be able to separate classes using a hyperplane[4]

SVM kernel trick: map features to a higher dimensional space to be able to separate classes using a hyperplane[4]

2.2 Unsupervised Learning

2.2.1 K-Means Clustering

A method for partitioning data into \(k\) clusters:

  • \(k\) must be chosen a priori
  • The principle is to start with random centroids and then iteratively:
    1. Classify points using the closest centroid
    2. Move each centroid to the real centroid of its class
  • Classical method with lots of variation

k-Means Clustering convergence process[5]

\(k\)-Means Clustering convergence process[5]

2.2.2 Hierarchical Clustering

Builds a hierarchy of clusters using either agglomerative or divisive methods:

  • Build a full hierarchy top-down (divisive) or bottom-up (agglomerative)
  • Create any number of clusters by cutting the tree

Hierarchical Clustering[6]

Hierarchical Clustering[6]

2.2.3 Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

Clustering based on the density of data points:

  • Divides points in 4 categories: core points (in red below), directly reachable (yellow), reachable and outliers (blue)
  • Only two parameters: radius size (\(\epsilon\)) and number of neighbors to be core (\(min_{pts}\))

DBSCAN Illustration[7]

DBSCAN Illustration[7]

DBSCAN Result[8]

DBSCAN Result[8]

2.2.4 Principal Component Analysis (PCA)

Dimensionality reduction technique to project data into lower dimensions:

  • Project data into a space of lower dimension
  • Keep as much variance (so as much information) as possible

PCA Illustration[9]

PCA Illustration[9]

2.2.5 t-Distributed Stochastic Neighbor Embedding (t-SNE)

A nonlinear dimensionality reduction technique primarily used for visualization of high-dimensional data

t-SNE Result on MNIST dataset[10]

t-SNE Result on MNIST dataset[10]

2.3 Reinforcement Learning

2.3.1 Definitions

Reinforcement Learning Framework[11]

Reinforcement Learning Framework[11]

2.3.2 Definitions

Policy (in RL)

Function that returns an action given the state of the environment.

On-Policy vs. Off-Policy

Refers to the policy that is used to update the knowledge during training. On-Policy uses the learnt policy, while Off-Policy uses the optimal policy.

2.3.3 Q-Learning / SARSA - Similarities

A value-based reinforcement learning algorithm:

  • Iteratively learn the expected output of each action in each situation
  • Limited to discrete and simple environments
  • Neural Network variants (like Deep Q-Learning) allow to handle more complex environments

Q-Learning Table[12]

Q-Learning Table[12]

2.3.4 Q-Learning / SARSA - Differences

The two methods differ by the estimation of the reward that is used to update the Q-table:

  • Q-Learning (Off-Policy) uses the best possible reward: \(Q(s,a) \leftarrow Q(s,a) + \alpha(r + \gamma Q(s',a') - Q(s,a))\)
  • SARSA (On-Policy) uses the reward of the next action: \(Q(s,a) \leftarrow Q(s,a) + \alpha(r + \gamma \max_{a''} Q(s',a'') - Q(s,a))\)

where:

  • \(a\) is the action taken while the environment is in state \(s\), leading to state \(s'\) where action \(a'\) will be taken
  • \(\alpha\) is the learning rate
  • \(r\) is the reward received after taking action \(a\) in state \(s\) and arriving in state \(s'\)
  • \(\gamma\) is the discount factor defining how much we value long-term rewards relatively to short-term rewards

2.3.5 Policy Gradient Methods

Given a model defined with a certain number of parameters, optimize the policy by adjusting these parameters in the direction that improves performance (e.g., REINFORCE algorithm)

2.3.6 Actor-Critic Methods

Combination of an Actor and a Critic learning simultaneously:

  • The Actor learns the policy through a parametrized function
  • The Critic estimates the value of each action and gives feedback to the Actor

3 Usual pipeline

3.0.1 Overview

  1. Data acquisition
  2. Data preprocessing
  3. Model selection
  4. Model evaluation
  5. Final model training

3.0.2 Data acquisition

Gather the data, potentially from multiple different sources. Choosing the right sources can also depend on the choices made in the next steps.

3.1 Data preprocessing

3.1.1 Different issues

Multiple sources of issues and steps to perform:

  1. Handle different formats
  2. Remove outliers (mostly for raw data)
  3. (Optionally) extract features
  4. Handle missing data
  5. Normalize

3.1.2 Why normalization?

Idea

A priori all features have the same importance, so none of them should have an advantage. Therefore, having features with larger values than others would be detrimental.

Usually, all features are individually normalized over the whole dataset, to obtain a distribution with an average of 0 and a standard deviation of 1:

\[ \begin{align*} \hat{X} & = \sum\limits_{j=0}^n X_j \\ \sigma_X & = \sum\limits_{j=0}^n (X_j - \hat{X})^2 \\ \forall k \in [0, \cdots, n ], X_k & = \frac{X_k - \hat{X}}{\sigma_X} \end{align*} \]

3.1.3 Model selection

  • Type of model (ML, NN, DL, …)
  • Complexity:
    • Number of features
    • Type of output
    • Size of the layers (for NN)
    • Number of layers (for NN)
  • Hyperparameters

3.1.4 Model training

  • Loss selection: depends on the task, the objectives, the specific issues to solve
  • Training process selection (lots of different tweaks and improvements can be implemented in NN training)
  • Hyperparameter tuning, by repeatedly:
    • Selecting one or multiple configurations of hyperparameters
    • Training the model one or multiple times
    • Determining the best hyperparameters

3.1.5 Model evaluation - Criteria

Criteria selection among the many possible ones:

  • For classification:
    • Accuracy: for balanced datasets
    • Precision: when false positives are costly
    • Recall: when false negatives are costly
    • F1-Score: when class distribution is unbalanced
  • For regression:
    • Mean Absolute Error (MAE)
    • Mean Square Error (MSE): more sensitive to large errors than MAE

3.1.6 Model evaluation - Cross-validation

Cross-validation

Method to estimate real performance of the model by:

  1. Splitting the dataset in multiple parts (usually 5)
  2. For different combinations of these parts (usually 5), training and evaluating the model

Cross-validation[13]

Cross-validation[13]

3.1.7 Final model training

Once the data is preprocessed, the model is selected, the hyperparameters chosen and optimized, the final model can be trained multiple times to keep the best one.

4 Challenges

4.1 Data

4.1.1 Quality

Quality of the data is obviously crucial to train well-performing models. Quality encompasses multiple aspects:

  • Raw data quality: the input data must possess high enough details for the task to be even achievable. Be careful however as more features imply larger models which are longer and harder to train.
  • Annotations quality: the annotations must be as precise and correct as possible in the context of the task at hand. Every blunder or outlier in a supervised dataset will slow down training and might result in unexpected behaviors of the trained model.

4.1.2 Diversity

Diversity is the most important aspect of a dataset because ML models are great at generalizing but bad at guessing in new scenarios. There are different aspects to diversity to keep in mind:

  • A well-defined task is crucial to identify all the various cases that we want our model to handle. Being as exhaustive as possible when selecting the training instances will accelerate training and improve the model by a lot
  • Balancing the dataset can also improve the training. When training on imbalanced datasets (i.e. when some cases are much more represented than others), the model will focus on the most represented situations as it will be the easiest and quickest way to get better results. There are ways of correcting this phenomenon, but it is always better to avoid it if possible when building the dataset.

4.1.3 Biases and fairness

Biased

Refers to a model which always makes the same kind of wrong predictions in similar cases.

In practice, a model trained on biased data will most of the time repeat the biased results. This can have major consequences and shouldn’t be underestimated: even a cold-hearted ML algorithm is not objective if it wasn’t trained on objectively chosen and annotated data.

However, there exist model architectures, training and evaluation methods to prevent and detect biases, which can sometimes allow to build unbiased models using biased data. But this needs to be well-thought and won’t happen unless

4.2 Underfitting and Overfitting

4.2.1 Definitions

Underfitting

When a model is too simple to properly extract information from a complex task. Can also be explained by key information missing in the input features.

Overfitting

When a model is too complex to properly generalize to new data. Happens often when a NN is trained too long on a dataset that is not diverse enough and learns the noise in the data.

4.2.2 Illustrations

Underfitting and overfitting on a regression task[14]

Underfitting and overfitting on a regression task[14]

Underfitting and overfitting on a classification task[14]

Underfitting and overfitting on a classification task[14]

4.2.3 Solutions

Solution Underfitting Overfitting
Complexity Increase Reduce
Number of features Increase Reduce
Regularization Reduce Increase
Training time Increase Reduce

General strategies:

  • Cross-validation to identify problems
  • Grid search/random search to tune hyperparameters and balance between underfitting and overfitting
  • Ensemble methods to reduce overfitting by using many smaller models instead of one big

4.3 Interpretable and Explainable

4.3.1 Definitions

Interpretable

Qualifies a ML model which decision-making process is straightforward and transparent, making it directly understandable by humans. This requires to restrict the model complexity.

Explainable

Qualifies a ML model which decision-making process can be partly interpreted afterwards using post hoc interpretation techniques. These techniques are often used on models which are too complex to be interpreted.

5 Python libraries

5.1 Data manipulation

5.1.1 NumPy - Strengths

  • NumPy
    • Fast numerical operations
    • Matrices with any number of dimensions (called arrays)
    • Lots of convenient operators on arrays

5.1.2 NumPy - Examples

import numpy as np
array = np.array([[0, 1], [2, 3]])
print(array)
[[0 1]
 [2 3]]
print(5 * array)
[[ 0  5]
 [10 15]]
print(np.pow(array, 3))
[[ 0  1]
 [ 8 27]]
print(array @ array)
[[ 2  3]
 [ 6 11]]
print(np.where(array < 2, 10 - array, array))
[[10  9]
 [ 2  3]]

Similar functionalities in PyTorch

5.1.3 Pandas - Strengths

  • Pandas
    • Can store any type of data
    • 1D or 2D tables (called DataFrames)
    • Lots of convenient operators on DataFrames

5.1.4 Pandas - Examples

import pandas as pd
import numpy.random as npr
df = pd.DataFrame([
    ["Pi", 3.14159, npr.randint(-100, 101, (2, 2))],
    ["Euler's number", 2.71828, npr.randint(-100, 101, (2, 2))],
    ["Golden ratio", 1.61803, npr.randint(-100, 101, (2, 2))]
  ], columns = ["Names", "Values", "Random numbers because why not"])
display(df)
Names Values Random numbers because why not
0 Pi 3.14159 [[31, 1], [76, 96]]
1 Euler's number 2.71828 [[25, -9], [-59, 76]]
2 Golden ratio 1.61803 [[39, 76], [95, 90]]
display(df[df["Values"] > 2])
Names Values Random numbers because why not
0 Pi 3.14159 [[31, 1], [76, 96]]
1 Euler's number 2.71828 [[25, -9], [-59, 76]]
display(df[df["Names"].str.contains("n")])
Names Values Random numbers because why not
1 Euler's number 2.71828 [[25, -9], [-59, 76]]
2 Golden ratio 1.61803 [[39, 76], [95, 90]]

5.2 ML

5.2.1 SciPy

Scientific and technical computing based on NumPy. Documentation here

import scipy as sp
import numpy as np
# Define a linear system Ax = b
A = np.array([[3, 2], [1, 4]])
b = np.array([6, 8])

# Solve the system
x = sp.linalg.solve(A, b)
print(f"Solution to the linear system Ax = b:", x)
Solution to the linear system Ax = b: [0.8 1.8]
# Define a function to integrate
def integrand(x):
    return np.exp(-x**2/2)/np.sqrt(2*np.pi)

# Integrate the function from 0 to infinity
result, error = sp.integrate.quad(integrand, 0, np.inf)
print("Integral of exp(-x**2/2)/sqrt(2*np.pi) from 0 to infinity:", result)
Integral of exp(-x**2/2)/sqrt(2*np.pi) from 0 to infinity: 0.4999999999999999

5.2.2 scikit-learn

A lot of tools for ML (except DL). Documentation here

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression, load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Generate data, create and fit a Linear Regression model
X, y = make_regression(n_samples=100, n_features=1, noise=0.1)
model = LinearRegression()
model.fit(X, y)

print("Coefficient:", model.coef_, "Intercept:", model.intercept_)
Coefficient: [13.25278978] Intercept: 0.01644970591329642
# Load the iris dataset and split it in training and test
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and fit a random forest classifier on training data, and use it
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
Accuracy: 1.0

5.3 Visualization

5.3.1 Matplotlib

Examples

5.3.2 Plotly

Examples

Interactive charts.

5.3.3 Seaborn

Examples

6 Resources

6.1 Machine Learning

6.2 References

[1] Anscombe.svg: Schutz(label using subscripts): Avenue. “Simple linear regression.” Available at: https://commons.wikimedia.org/w/index.php?curid=9838454.
[2] Canley. “Binomial logistic regression.” Available at: https://commons.wikimedia.org/w/index.php?curid=116449187.
[3] Antti Ajanki AnAj. “K-NN classification.” Available at: https://commons.wikimedia.org/w/index.php?curid=2170282.
[4] Vector: Zirguezi Original: Alisneaky. “SVM kernel trick: Map features to a higher dimensional space to be able to separate classes using a hyperplane.” Available at: https://commons.wikimedia.org/w/index.php?curid=47868867.
[5] Chire. “K-means clustering convergence process.” Available at: https://commons.wikimedia.org/w/index.php?curid=59409335.
[6] derivative work: Mhbrugman Stathis Sideris. “Hierarchical clustering.” Available at: https://commons.wikimedia.org/w/index.php?curid=7344806.
[7] Chire. “DBSCAN illustration.” Available at: https://commons.wikimedia.org/w/index.php?curid=17045963.
[8] Chire. “DBSCAN result.” Available at: https://commons.wikimedia.org/w/index.php?curid=17085332.
[9] Nicoguaro. “PCA illustration.” Available at: https://commons.wikimedia.org/w/index.php?curid=46871195.
[10] Kyle McDonald. “T-SNE result on MNIST dataset.” Available at: https://commons.wikimedia.org/w/index.php?curid=115726949.
[11] Megajuice. “Own work.” Available at: https://commons.wikimedia.org/w/index.php?curid=57895741.
[12] LearnDataSci. “Own work - article.” Available at: https://commons.wikimedia.org/w/index.php?curid=69947708.
[13] Gufosowa. “Cross-validation.” Available at: https://commons.wikimedia.org/w/index.php?curid=82298768.
[14] GeeksforGeeks. “Underfitting and overfitting in machine learning.” Available at: https://www.geeksforgeeks.org/underfitting-and-overfitting-in-machine-learning/.